Skip to content

feat(gateway): add TOML configuration file (RFC 0003)#1317

Open
TaylorMutch wants to merge 7 commits into
mainfrom
tmutch/gateway-config-impl
Open

feat(gateway): add TOML configuration file (RFC 0003)#1317
TaylorMutch wants to merge 7 commits into
mainfrom
tmutch/gateway-config-impl

Conversation

@TaylorMutch
Copy link
Copy Markdown
Collaborator

Summary

Implements RFC 0003: an opt-in --config / OPENSHELL_GATEWAY_CONFIG flag that loads a TOML file with gateway-wide settings and per-driver tables. Source precedence stays CLI > env > file > built-in default, so existing flags and OPENSHELL_* env vars are not affected when no file is supplied. Includes the matching Helm ConfigMap cutover.

Related Issue

N/A — implements RFC 0003.

Changes

  • Loader (crates/openshell-server/src/config_file.rs): TOML parser with deny_unknown_fields everywhere, version-1 enforcement, env-only rejection of database_url / ssh_handshake_secret, and a driver_table() that overlays inheritable [openshell.gateway] defaults onto each [openshell.drivers.<name>] table.
  • CLI (crates/openshell-server/src/cli.rs): new --config flag; merge_file_into_args() applies file values only when clap's ValueSource::DefaultValue indicates an arg was not supplied; new build_vm_config() / build_docker_config() consume merged driver tables.
  • Driver crates (kubernetes, docker, podman, vm): config structs derive Deserialize/Serialize with struct-level #[serde(default, deny_unknown_fields)]. SupervisorSideloadMethod gains Deserialize with kebab-case rename. Refreshed Default impls where needed.
  • run_server / build_compute_runtime thread Option<&ConfigFile> through; the ad-hoc env reads for Kubernetes/Podman supervisor settings now fold into the file-driven pipeline (env still wins).
  • Helm chart: new templates/gateway-config.yaml ConfigMap rendering gateway.toml from existing .Values.server.* / .Values.service.* / .Values.supervisor.* keys; StatefulSet mounts it at /etc/openshell/gateway.toml, passes --config, and drops the 18 migrated OPENSHELL_* env entries. Only OPENSHELL_SSH_HANDSHAKE_SECRET (Secret-backed) remains.
  • Docs: architecture/gateway.md documents source precedence and the inheritance allowlist; examples/gateway/gateway.example.toml ships a worked example.
  • Inheritance scope (Q4 high-overlap set, per-driver allowlist): default_image, supervisor_image, image_pull_policy, client_tls_secret_name, host_gateway_ip, enable_user_namespaces, guest_tls_ca/cert/key.
  • Follow-up commit removes ssh_handshake_skew_secs / ssh_handshake_secret references from examples and architecture ahead of their planned removal.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated — 10 new tests in config_file.rs (parse, secret rejection, version, unknown field, inheritance merge, override) and 6 new tests in cli.rs (file value applies when defaulted, CLI overrides file, env overrides file, OIDC block, driver inheritance, driver override). Full mise run test passes.
  • E2E tests — test:e2e label applied; the sandbox-infra path should be exercised against the new --config startup mode.

Manual smoke:

  • helm lint . clean, helm template . renders gateway.toml with TLS, with TLS disabled, and with OIDC enabled.
  • TOML loader rejects database_url, ssh_handshake_secret, version = 2, and unknown fields with friendly errors.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Deferred (explicit follow-ups)

  • Helm unit tests for the new ConfigMap (separate effort).
  • "Raw ConfigMap contents" escape hatch in values.yaml.
  • --config-dir directory loader (RFC open question Prototype data access layer that can connect to outlook data #2).
  • RFC schema-gap fixes (Kubernetes ssh fields, Podman 6 missing fields, VM 9+ missing fields).

@TaylorMutch TaylorMutch added the test:e2e Requires end-to-end coverage label May 12, 2026
@github-actions
Copy link
Copy Markdown

Label test:e2e applied for e324f7e. Open the existing run and click Re-run all jobs to execute with the label set. The E2E Gate check on this PR will flip green automatically once the run finishes.

@TaylorMutch TaylorMutch force-pushed the tmutch/gateway-config-impl branch from bf3670f to 02df5f7 Compare May 12, 2026 22:38
Copy link
Copy Markdown
Collaborator

@drew drew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran this through codex, these seem reasonable to consider. "Restart pods when the rendered ConfigMap changes" feels possibly out of scope.

Review Comments

[P1] Move Helm grpc_endpoint into a supported table

/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/deploy/helm/openshell/templates/gateway-config.yaml:36

The generated Helm config always writes grpc_endpoint under [openshell.gateway], but GatewayFileSection uses deny_unknown_fields and has no grpc_endpoint field. Default chart installs will fail during config_file::load before the gateway starts; this needs to move under
[openshell.drivers.kubernetes] or be added to the gateway schema and merge path.

[P1] Keep Kubernetes imagePullPolicy empty by default

/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/crates/openshell-driver-kubernetes/src/config.rs:76

When no Kubernetes image pull policy is configured, this now defaults to missing, which is the Podman vocabulary and is not a valid Kubernetes imagePullPolicy. Because the driver emits the field whenever it is non-empty, default Kubernetes deployments will create pods rejected by the
API; keep this default empty so Kubernetes applies its own default.

[P1] Preserve driver-table gRPC endpoints

/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/crates/openshell-server/src/lib.rs:508

When a TOML file sets [openshell.drivers.kubernetes].grpc_endpoint, kubernetes_config_from_file reads it and this assignment immediately overwrites it with config.grpc_endpoint, which can only come from CLI/env in this patch. File-only Kubernetes configs therefore pass an empty
OPENSHELL_ENDPOINT into sandboxes, so the callback path breaks even though the driver table contains the endpoint.

[P2] Keep disableTls wired after moving config to TOML

/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/deploy/helm/openshell/templates/statefulset.yaml:49

When .Values.server.disableTls is true, the ConfigMap omits the TLS section, but this args/env block no longer passes --disable-tls or OPENSHELL_DISABLE_TLS. RunArgs.disable_tls remains false, so startup still requires TLS cert/key/CA paths and the plaintext Helm option fails.

[P2] Restart pods when the rendered ConfigMap changes

/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/deploy/helm/openshell/templates/statefulset.yaml:65

After moving settings such as log level, sandbox image, TLS paths, and driver options into this ConfigMap, Helm upgrades that change those values no longer change the StatefulSet pod template. The gateway only reads /etc/openshell/gateway.toml at startup, so without a checksum
annotation or similar rollout trigger, pods keep running with stale config.

[P2] Honor health and metrics listener addresses

/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/crates/openshell-server/src/cli.rs:511

For config files that set health or metrics on a different interface than the main listener, these lines copy only the port and later build the listener with args.bind_address. For example, a health listener intended for loopback can be exposed on the public bind address; preserve the
full SocketAddr from the TOML file or validate that separate IPs are unsupported.

[P2] Apply ssh_session_ttl_secs from the TOML file

/Users/anewberry/dev/openshell.tmutch-gateway-config-impl/crates/openshell-server/src/config_file.rs:101

The loader accepts ssh_session_ttl_secs, but merge_file_into_args and run_from_args never apply it to openshell_core::Config with with_ssh_session_ttl_secs. Operators can set the documented key and get a successful parse while session tokens still use the built-in default TTL.

@TaylorMutch
Copy link
Copy Markdown
Collaborator Author

Sounds good, agree these look reasonable to me.

Comment thread examples/gateway/docker.example.toml Outdated
Comment thread deploy/helm/openshell/templates/statefulset.yaml
Introduces an opt-in --config / OPENSHELL_GATEWAY_CONFIG flag that loads a
TOML file with gateway-wide settings and per-driver tables. Source
precedence is CLI > env > file > built-in default, implemented via clap's
ValueSource so existing flags and env vars keep their priority.

Driver crates (kubernetes, docker, podman, vm) now derive Deserialize on
their config structs. SupervisorSideloadMethod gains Deserialize with
kebab-case rename. A per-driver inheritance allowlist on the loader side
overlays [openshell.gateway] shared defaults (default_image,
supervisor_image, image_pull_policy, guest_tls_*, ssh_handshake_skew_secs,
client_tls_secret_name, host_gateway_ip, enable_user_namespaces) onto
each [openshell.drivers.<name>] table before deserialization.

The Helm chart renders a new gateway-config ConfigMap and mounts it at
/etc/openshell/gateway.toml. The migrated OPENSHELL_* env entries are
dropped from the StatefulSet — only the Secret-backed
OPENSHELL_SSH_HANDSHAKE_SECRET remains. database_url stays on --db-url.

Adds examples/gateway/gateway.example.toml and updates architecture/gateway.md
with the source precedence and inheritance rules.
…from examples

Both fields are scheduled for removal. Remove the example values and the
env-only note so the gateway.toml example and the architecture doc stop
recommending settings that will not exist much longer.
Adds focused single-driver examples next to the comprehensive
gateway.example.toml: kubernetes, docker, podman, and microvm. Each one
demonstrates the realistic settings for that driver plus how shared
[openshell.gateway] defaults inherit into the driver table.

A new unit test (`checked_in_examples_parse`) loads every example through
the config_file loader so schema drift fails CI rather than silently
shipping a broken example.
Kubernetes and Podman use mutually-incompatible vocabularies for the same
TOML key:

  - Kubernetes: `Always | IfNotPresent | Never` (free-form string passed
    verbatim to the K8s API).
  - Podman: `always | missing | never | newer` (strict lowercase enum
    deserialised into `ImagePullPolicy`).

No value means the same thing in both drivers. Sharing the key at
`[openshell.gateway]` scope and inheriting it into every active driver's
table meant any value safe for one driver was either wrong or silently
dropped for the other (`IfNotPresent` → `ImagePullPolicy::Missing` after
`.unwrap_or_default()`). Operators run one driver per gateway, so the
"shared default" never pays for itself.

Make `image_pull_policy` driver-local:

  - Remove the field from `GatewayFileSection` and from
    `inheritable_keys()` for both Kubernetes and Podman.
  - Drop the file→`RunArgs` merge for the gateway-scope key.
  - Stop unconditionally clobbering the driver value with
    `config.sandbox_image_pull_policy` in the runtime wiring — only apply
    the CLI/env override when it was set (and, for Podman, only when it
    parses into the lowercase enum).
  - Move the key under `[openshell.drivers.kubernetes]` and
    `[openshell.drivers.podman]` in every example, the RFC, the
    architecture doc, and the Helm-rendered gateway ConfigMap.

The supervisor pull policy follows the same shape: it is K8s-only and
moves into `[openshell.drivers.kubernetes]` alongside `image_pull_policy`
in the Helm template.
Resolves the P1 and P2 issues raised in PR #1317:

- Helm gateway ConfigMap moves `grpc_endpoint` under
  `[openshell.drivers.kubernetes]` so the default install no longer fails
  the gateway's `deny_unknown_fields` schema check.
- `kubernetes_config_from_file` and `podman_config_from_file` only let
  the gateway-wide CLI/env `grpc_endpoint` overwrite the driver-table
  value when it was actually supplied, preserving file-only configs.
- Kubernetes driver default `image_pull_policy` is now empty (was Podman
  vocabulary "missing"), so default deployments let the Kubernetes API
  apply its own policy instead of being rejected.
- New `disable_tls` gateway field plumbs `.Values.server.disableTls`
  through the TOML ConfigMap instead of relying on env vars dropped from
  the StatefulSet.
- StatefulSet pod template now carries a `checksum/gateway-config`
  annotation so `helm upgrade` rolls pods when the ConfigMap changes.
- Auxiliary listener resolution preserves the full `SocketAddr` from
  `health_bind_address` / `metrics_bind_address`, so a loopback-pinned
  health port is not silently relocated onto the public bind address.
- `ssh_session_ttl_secs` from the file is now applied to `Config` (it
  was previously accepted by the loader but never read).

New regression coverage: cli-level merge tests for the new fields plus
helm-unittest assertions for the ConfigMap shape, checksum annotation,
and `disable_tls` rendering.
Replaces the per-driver example files under examples/gateway/ with a
single published reference page at docs/reference/gateway-config.mdx
covering source precedence, layout, the full example, and the four
per-driver examples (Kubernetes, Docker, Podman, microVM). Drew flagged
during PR #1317 review that the examples belong with the user-facing
docs rather than in a sibling examples/ directory.

The cross-references in architecture/gateway.md and RFC 0003 are updated
to point at the new docs page; the round-trip test in config_file.rs is
removed (schema coverage stays on the inline parses_full_example test
and per-field merge tests — doc-snippet drift belongs in a separate
docs-lint, not in a cross-tree Rust unit test).
@github-actions
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants